tidyverse I: dplyr;
gapmindertidyverseII: readr, ggplot2;
Public Data, WDI, WIR, etctidyverse III: tidyr, etc.; WDI, WIR,
etctidyverse IV; WDI, WIR, etcbookdown site: https://bookdown.org
coursera courseslearnrAn inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data.
R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.
The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
A GNU package, the official R software environment is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License.
“R Programming for Data Science” by Roger Peng
When you talk about choosing programming languages, I always say you shouldn’t pick them based on technical merits, but rather pick them based on the community. And I think the R community is like really, really strong, vibrant, free, welcoming, and embraces a wide range of domains. So, if there are like people like you using R, then your life is going to be much easier. That’s the first reason.
Interview: “Advice to Young (and Old) Programmers, H. Wickham”
RStudio is an integrated development environment, or IDE, for R programming.
RStudio is an integrated development environment (IDE) for R, a programming language for statistical computing and graphics. It is available in two formats: RStudio Desktop is a regular desktop application while RStudio Server runs on a remote server and allows accessing RStudio using a web browser.
To download R, go to CRAN, the comprehensive R archive network. CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages. Don’t try and pick a mirror that’s close to you: instead use the cloud mirror, https://cloud.r-project.org, which automatically figures it out for you.
A new major version of R comes out once a year, and there are 2-3 minor releases each year. It’s a good idea to update regularly.
Download and install it from http://www.rstudio.com/download.
RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know.
Or,
In this way the working directory of the session is set to the
project directory and R can search releted files without difficulty
(getwd(), setwd())
RStudio Cloud is a lightweight, cloud-based solution that allows anyone to do, share, teach and learn data science online.
Start RStudio and create a project, or login to Posit Cloud and create a project.
Input the following codes into Console in the left bottom pane.
head(cars)
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu.:12.0 1st Qu.: 26.00
Median :15.0 Median : 36.00
Mean :15.4 Mean : 42.98
3rd Qu.:19.0 3rd Qu.: 56.00
Max. :25.0 Max. :120.00
plot(cars)
plot(cars) # cars: Speed and Stopping Distances of Cars
abline(lm(cars$dist~cars$speed))
lm(cars$dist~cars$speed)
Call:
lm(formula = cars$dist ~ cars$speed)
Coefficients:
(Intercept) cars$speed
-17.579 3.932
summary(lm(cars$dist~cars$speed))
Call:
lm(formula = cars$dist ~ cars$speed)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
head(cars): The first 6 rows of the pre-installed data
cars.str(cars): The data structure of the pre-installed data
cars.summary(cars): The summary of the pre-installed data
cars.plot(cars): A scatter plot of the pre-installed data
cars.
plot(cars$dist~cars$speed)cars$dist, cars$[[2]],
cars[,2] are sameabline(lm(cars$dist~cars$speed)): Add a regression line
of a linear modellm(cars$dist~cars$speed): The equation of the
regression linesummary(lm(cars$dist~cars$speed): The summary of the
linear regression modelhist(cars$dist)
hist(cars$speed)
View(cars)?cars: same as help(cars)??cars: same as `help.search(“cars”)datasets?datasets
library(help = "datasets")
data() shows all data already attached and
available.
Pick a data in the datasets package and try
head()str()summary()and some more.
irishead(iris)
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Can you plot?
plot(iris$Sepal.Length, iris$Sepal.Width)
tidyverse PackagesSys.setenv(LANG = "en")
dir.create("data")
basics.Rcoronavirus.RTo run a code: at the cursor press Ctrl+Shift+Enter (Win) or Cmd+Shift+Enter (Mac).
R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network).
You can install packages by “Install Packages…” under “Tool” in the top menu.
install.packages("tidyverse")install.packages("rmarkdown")Choose R Notebook from the pull down File menu in the top bar.
Default* is as follows
---
title: "R Notebook"
output: html_notebook
---
Template
---
title: "Title of R Notebook"
author: "ID and Your Name"
date: "2023-01-08"
output:
html_notebook:
# number_sections: yes
# toc: true
# toc_float: true
---
number_sections: no.toc: true - default is
toc: false.toc_float: true - default is
toc_float: falseInsert Chunk in Code pull down menu in the top bar, or use the C button on top. You can use shortcut keys listed under Tools in the top bar.
library(tidyverse)
Let us assign the iris data in the pre-installed package
datasets to df_iris. You can give any name
starting from an alphabet, though there are some rules.
df_iris <- datasets::iris
class(df_iris)
[1] "data.frame"
The class of data iris is data.frame, the
basic data class of R. You can assign the same data as a
tibble, the data class of tidyverse as
follows.
tbl_iris <- as_tibble(datasets::iris)
class(tbl_iris)
[1] "tbl_df" "tbl" "data.frame"
df_iris <- iris can replace
df_iris <- datasets::iris because the package
datasets is installed and attached as default. Since you
may have other data called iris included in a different
package or you may have changed iris before, it is safer to
specify the name of the package with the name of the data.tf_iris and tbl_iris behave differently. It is
because of the default settings of R Markdown.The View command open up a window to show the contents
of the data and you can use the filter as well.
View(df_iris)
The following simple command also shows the data.
df_iris
The output within R Notebook is a tibble style. Try the same command in Console.
slice(df_iris, 1:10)
1:10
[1] 1 2 3 4 5 6 7 8 9 10
Let us look at the structure of the data. You can try
str(df_iris) on Console or by adding a code chunk in R
Notebook introducing later.
glimpse(df_iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8…
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3.0…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.6, 1.4…
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.2, 0.1…
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, setosa,…
There are six types of data in R; Double, Integer, Character, Logical, Raw, Complex.
The names after $ are column names. If you call
df_iris$Species, you have the Species column. Species is in
the 5th collumn, typeof(df_iris[[5]]) does the same as the
next.
df_iris[2,4] =0.2 is the fourth entry of
Sepal.Width.
typeof(df_iris$Species)
[1] "integer"
class(df_iris$Species)
[1] "factor"
For factors = fct see the
R Document or an explanation in Factor
in R: Categorical Variable & Continuous Variables.
typeof(df_iris$Sepal.Length)
[1] "double"
class(df_iris$Sepal.Length)
[1] "numeric"
Q1. What are the differences ofdf_iris,
slice(df_iris, 1:10) and glimpse(df_iris)
above?
Q2. What are the differences ofdf_iris,
slice(df_iris, 1:10) and glimpse(df_iris) in
the console?
The following is very convenient to get the summary information of a data.
summary(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Minimum, 1st Quadrant (25%), Median, Mean, 3rd Quadrant (75%), Maximum, and the count of each factor.
We use ggplot to draw graphs. The scatter plot is a
projection of data with two variables \(x\) and \(y\).
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Add title and labels adding labs().
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point() +
labs(title = "Title", x = "Label for x", y = "Label for y")
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
labs(title = "Scatter Plot of Sepal Data of Iris", x = "Sepal Length", y = "Sepal Width")
Add different colors automatically to each species. Can you see each group?
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, shape = Species)) +
geom_point()
The boxplot compactly displays the distribution of a continuous variable.
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot()
Visualize the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram()
Change the number of bins by bins =
<number>.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram(bins = 10)
Professor Kaizoji will cover the mathematical models and hypothesis testings.
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
swirl} website: https://swirlstats.comswirl for
exercises.You can install other swirl courses as well
install_course("Course Name Here")install.packages("swirl") # Only the first time.
library(swirl) # Everytime you start swirl
swirl() # Everytime you start or resume swirl
1: Basic Building Blocks 2: Workspace and Files 3: Sequences of Numbers
4: Vectors 5: Missing Values 6: Subsetting Vectors
7: Matrices and Data Frames 8: Logic 9: Functions
10: lapply and sapply 11: vapply and tapply 12: Looking at Data
13: Simulation 14: Dates and Times 15: Base Graphics
1, 3, 4, 5, 6, 7, 12, 15, 14, 8, 9, 10, 11, 13, 2
swirl Session… <– That’s your cue to press Enter to continue
You can exit swirl and return to the R prompt (>) at any time by pressing the Esc key.
If you are already at the prompt, type bye() to exit and save your progress. When you exit properly, you’ll see a short message letting you know you’ve done so.
When you are at the R prompt (>):
You will encounter the message like ‘Would you like to receive credit
for completing this course on Coursera.org?’ at the end of each course.
This is for coursera courses. Select ‘NO’.
basics.RThe script with the outputs.
```#{r basics, cash = TRUE, eval=FALSE} ################# # # basics.R # ################ # ‘Quick R’ by DataCamp may be a handy reference: # https://www.statmethods.net/management/index.html # Cheat Sheet at RStudio: https://www.rstudio.com/resources/cheatsheets/ # Base R Cheat Sheet: https://github.com/rstudio/cheatsheets/raw/main/base-r.pdf # To execute the line: Control + Enter (Window and Linux), Command + Enter (Mac) ## try your experiments on the console
3 + 7
3 + 10 / 2
3^2
2^3
222
x <- 5
x
this_is_a_long_name <- 5^3
this_is_a_long_name
char_name <- “What is your name?”
char_name
ls() ls.str()
5:10
a <- seq(5,10)
a
b <- 5:10
identical(a,b)
seq(5,10,2) # same as seq(from = 5, to = 10, by = 2)
c1 <- seq(0,100, by = 10)
c2 <- seq(0,100, length.out = 10)
c1
c2
length(c1)
(die <- 1:6)
zero_one <- c(0,1) # same as 0:1
die + zero_one # c(1,2,3,4,5,6) + c(0,1). re-use
d1 <- rep(1:3,2) # repeat
d1
die == d1
d2 <- as.character(die == d1)
d2
d3 <- as.numeric(die == d1)
d3
typeof(d1); class(d1)
typeof(d2); class(d2)
typeof(d3); class(d3)
sqrt(2)
sqrt(2)^2
sqrt(2)^2 - 2
typeof(sqrt(2))
typeof(2)
typeof(2L)
5 == c(5)
length(5)
(A_Z <- LETTERS)
A_F <- A_Z[1:6]
A_F
A_F[3]
A_F[c(3,5)]
large <- die > 3
large
even <- die %in% c(2,4,6)
even
A_F[large]
A_F[even]
A_F[die < 4]
2.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.1
plot(cars)
? cars
data()
help(plot) ? par
head(cars)
str(cars)
summary(cars)
x <- cars\(speed y <- cars\)dist
min(x) mean(x) quantile(x)
plot(cars)
abline(lm(cars\(dist ~ cars\)speed))
summary(lm(cars\(dist ~ cars\)speed))
boxplot(cars)
hist(cars\(speed) hist(cars\)dist) hist(cars$dist, breaks = seq(0,120, 10))
---
### coronavirus.R
The script and its outputs.
__coronavirus.csv__ is very large
<!-- rnb-text-end -->
<!-- rnb-chunk-begin -->
<!-- rnb-source-begin eyJkYXRhIjpbIiMgaHR0cHM6Ly9jb3JvbmF2aXJ1cy5qaHUuZWR1L21hcC5odG1sIiwiIyBKSFUgQ292aWQtMTkgZ2xvYmFsIHRpbWUgc2VyaWVzIGRhdGEiLCIjIFNlZSBSIHBha2FnZSBjb3JvbmF2aXJ1cyBhdDogaHR0cHM6Ly9naXRodWIuY29tL1JhbWlLcmlzcGluL2Nvcm9uYXZpcnVzIiwiIyBEYXRhIHRha2VuIGZyb206IGh0dHBzOi8vZ2l0aHViLmNvbS9SYW1pS3Jpc3Bpbi9jb3JvbmF2aXJ1cy90cmVlL21hc3Rlci9jc3YiLCIjIExhc3QgVXBkYXRlZCIsIlN5cy5EYXRlKCkiLCIiLCIjIyBEb3dubG9hZCBhbmQgcmVhZCBjc3YgKGNvbW1hIHNlcGFyYXRlZCB2YWx1ZSkgZmlsZSIsImNvcm9uYXZpcnVzIDwtIHJlYWQuY3N2KFwiaHR0cHM6Ly9naXRodWIuY29tL1JhbWlLcmlzcGluL2Nvcm9uYXZpcnVzL3Jhdy9tYXN0ZXIvY3N2L2Nvcm9uYXZpcnVzLmNzdlwiKSIsIiMgd3JpdGUuY3N2KGNvcm9uYXZpcnVzLCBcImRhdGEvY29yb25hdmlydXMuY3N2XCIpIiwiIiwiIyMgU3VtbWFyaWVzIGFuZCBzdHJ1Y3R1cmVzIG9mIHRoZSBkYXRhIiwiaGVhZChjb3JvbmF2aXJ1cykiLCJzdHIoY29yb25hdmlydXMpIiwiY29yb25hdmlydXMkZGF0ZSA8LSBhcy5EYXRlKGNvcm9uYXZpcnVzJGRhdGUpIiwic3RyKGNvcm9uYXZpcnVzKSIsIiIsInJhbmdlKGNvcm9uYXZpcnVzJGRhdGUpIiwidW5pcXVlKGNvcm9uYXZpcnVzJGNvdW50cnkpIiwidW5pcXVlKGNvcm9uYXZpcnVzJHR5cGUpIiwiIiwiIyMgU2V0IENvdW50cnkiLCJDT1VOVFJZIDwtIFwiSmFwYW5cIiIsImRmMCA8LSBjb3JvbmF2aXJ1c1tjb3JvbmF2aXJ1cyRjb3VudHJ5ID09IENPVU5UUlksXSIsImhlYWQoZGYwKSIsInRhaWwoZGYwKSIsIihwb3AgPC0gZGYwJHBvcHVsYXRpb25bMV0pIiwiZGYgPC0gZGYwW2MoMSw2LDcsMTMpXSIsInN0cihkZikiLCJoZWFkKGRmKSIsIiMjIyBhbHRlcm5hdGl2ZWx5LCIsImhlYWQoZGYwW2MoXCJkYXRlXCIsIFwidHlwZVwiLCBcImNhc2VzXCIsIFwicG9wdWxhdGlvblwiKV0pIiwiIyMjIiwiIiwiIyMgU2V0IHR5cGVzIiwiZGZfY29uZmlybWVkIDwtIGRmW2RmJHR5cGUgPT0gXCJjb25maXJtZWRcIixdIiwiZGZfZGVhdGggPC0gZGZbZGYkdHlwZSA9PSBcImRlYXRoXCIsXSIsImRmX3JlY292ZXJ5IDwtIGRmW2RmJGRhdGFfdHlwZSA9PSBcInJlY292ZXJ5XCIsXSIsImhlYWQoZGZfY29uZmlybWVkKSIsImhlYWQoZGZfZGVhdGgpIiwiaGVhZChkZl9yZWNvdmVyeSkiLCIiLCIjIyBIaXN0b2dyYW0iLCJwbG90KGRmX2NvbmZpcm1lZCRkYXRlLCBkZl9jb25maXJtZWQkY2FzZXMsIHR5cGUgPSBcImhcIikiLCJwbG90KGRmX2RlYXRoJGRhdGUsIGRmX2RlYXRoJGNhc2VzLCB0eXBlID0gXCJoXCIpIiwiIyBwbG90KGRmX3JlY292ZXJlZCRkYXRlLCBkZl9yZWNvdmVyZWQkY2FzZXMsIHR5cGUgPSBcImhcIikgIyBubyBkYXRhIGZvciByZWNvdmVyeSIsIiIsIiMjIFNjYXR0ZXIgcGxvdCBhbmQgY29ycmVsYXRpb24iLCJwbG90KGRmX2NvbmZpcm1lZCRjYXNlcywgZGZfZGVhdGgkY2FzZXMsIHR5cGUgPSBcInBcIikiLCJjb3IoZGZfY29uZmlybWVkJGNhc2VzLCBkZl9kZWF0aCRjYXNlcykiLCIiLCIiLCIjIyBJbiBhZGRpdGlvbiBzZXQgYSBwZXJpb2QiLCJzdGFydF9kYXRlIDwtIGFzLkRhdGUoXCIyMDIxLTA3LTAxXCIpIiwiZW5kX2RhdGUgPC0gU3lzLkRhdGUoKSAiLCJkZl9kYXRlIDwtIGRmW2RmJGRhdGUgPj1zdGFydF9kYXRlICYgZGYkZGF0ZSA8PSBlbmRfZGF0ZSxdIiwiIyMiLCIiLCIjIyBTZXQgdHlwZXMiLCJkZl9kYXRlX2NvbmZpcm1lZCA8LSBkZl9kYXRlW2RmX2RhdGUkdHlwZSA9PSBcImNvbmZpcm1lZFwiLF0iLCJkZl9kYXRlX2RlYXRoIDwtIGRmX2RhdGVbZGZfZGF0ZSR0eXBlID09IFwiZGVhdGhcIixdIiwiZGZfZGF0ZV9yZWNvdmVyeSA8LSBkZl9kYXRlW2RmX2RhdGUkZGF0YV90eXBlID09IFwicmVjb3ZlcnlcIixdIiwiaGVhZChkZl9kYXRlX2NvbmZpcm1lZCkiLCJoZWFkKGRmX2RhdGVfZGVhdGgpIiwiaGVhZChkZl9kYXRlX3JlY292ZXJ5KSIsIiIsIiMjIEhpc3RvZ3JhbSIsInBsb3QoZGZfZGF0ZV9jb25maXJtZWQkZGF0ZSwgZGZfZGF0ZV9jb25maXJtZWQkY2FzZXMsIHR5cGUgPSBcImhcIikiLCJwbG90KGRmX2RhdGVfZGVhdGgkZGF0ZSwgZGZfZGF0ZV9kZWF0aCRjYXNlcywgdHlwZSA9IFwiaFwiKSIsIiMgcGxvdChkZl9kYXRlX3JlY292ZXJlZCRkYXRlLCBkZl9kYXRlX3JlY292ZXJlZCRjYXNlcywgdHlwZSA9IFwiaFwiKSAjIG5vIGRhdGEgZm9yIHJlY292ZXJ5IiwiIiwicGxvdChkZl9kYXRlX2NvbmZpcm1lZCRjYXNlcywgZGZfZGF0ZV9kZWF0aCRjYXNlcywgdHlwZSA9IFwicFwiKSIsImNvcihkZl9kYXRlX2NvbmZpcm1lZCRjYXNlcywgZGZfZGF0ZV9kZWF0aCRjYXNlcykiLCIiLCIjIyMgUTAuIENoYW5nZSB0aGUgdmFsdWVzIG9mIHRoZSBsb2NhdGlvbiBhbmQgdGhlIHBlcmlvZCBhbmQgc2VlIHRoZSBvdXRjb21lcy4iLCIjIyMgUTEuIFdoYXQgaXMgdGhlIGNvcnJlbGF0aW9uIGJldHdlZW4gZGZfY29uZmlybWVkJGNhc2VzIGFuZCBkZl9kZWF0aCRjYXNlcz8iLCIjIyMgUTIuIERvIHdlIGhhdmUgYSBsYXJnZXIgY29ycmVsYXRpb24gdmFsdWUgaWYgd2Ugc2hpZnQgdGhlIGRhdGVzIHRvIGltcGxlbWVudCB0aGUgdGltZS1sYWc/IiwiIyMjIFEzLiBEbyB5b3UgaGF2ZSBhbnkgb3RoZXIgcXVlc3Rpb25zIHRvIGV4cGxvcmU/IiwiIiwiIyMjIyBFeHRyYSIsInBsb3QoZGZfY29uZmlybWVkJGRhdGUsIGRmX2NvbmZpcm1lZCRjYXNlcywgdHlwZSA9IFwiaFwiLCAiLCIgICAgIG1haW4gPSBwYXN0ZShcIkNvbWZpcm1lZCBDYXNlcyBpblwiLENPVU5UUlkpLCAiLCIgICAgIHhsYWIgPSBcIkRhdGVcIiwgeWxhYiA9IFwiTnVtYmVyIG9mIENhc2VzXCIpIl19 -->
```r
# https://coronavirus.jhu.edu/map.html
# JHU Covid-19 global time series data
# See R pakage coronavirus at: https://github.com/RamiKrispin/coronavirus
# Data taken from: https://github.com/RamiKrispin/coronavirus/tree/master/csv
# Last Updated
Sys.Date()
## Download and read csv (comma separated value) file
coronavirus <- read.csv("https://github.com/RamiKrispin/coronavirus/raw/master/csv/coronavirus.csv")
# write.csv(coronavirus, "data/coronavirus.csv")
## Summaries and structures of the data
head(coronavirus)
str(coronavirus)
coronavirus$date <- as.Date(coronavirus$date)
str(coronavirus)
range(coronavirus$date)
unique(coronavirus$country)
unique(coronavirus$type)
## Set Country
COUNTRY <- "Japan"
df0 <- coronavirus[coronavirus$country == COUNTRY,]
head(df0)
tail(df0)
(pop <- df0$population[1])
df <- df0[c(1,6,7,13)]
str(df)
head(df)
### alternatively,
head(df0[c("date", "type", "cases", "population")])
###
## Set types
df_confirmed <- df[df$type == "confirmed",]
df_death <- df[df$type == "death",]
df_recovery <- df[df$data_type == "recovery",]
head(df_confirmed)
head(df_death)
head(df_recovery)
## Histogram
plot(df_confirmed$date, df_confirmed$cases, type = "h")
plot(df_death$date, df_death$cases, type = "h")
# plot(df_recovered$date, df_recovered$cases, type = "h") # no data for recovery
## Scatter plot and correlation
plot(df_confirmed$cases, df_death$cases, type = "p")
cor(df_confirmed$cases, df_death$cases)
## In addition set a period
start_date <- as.Date("2021-07-01")
end_date <- Sys.Date()
df_date <- df[df$date >=start_date & df$date <= end_date,]
##
## Set types
df_date_confirmed <- df_date[df_date$type == "confirmed",]
df_date_death <- df_date[df_date$type == "death",]
df_date_recovery <- df_date[df_date$data_type == "recovery",]
head(df_date_confirmed)
head(df_date_death)
head(df_date_recovery)
## Histogram
plot(df_date_confirmed$date, df_date_confirmed$cases, type = "h")
plot(df_date_death$date, df_date_death$cases, type = "h")
# plot(df_date_recovered$date, df_date_recovered$cases, type = "h") # no data for recovery
plot(df_date_confirmed$cases, df_date_death$cases, type = "p")
cor(df_date_confirmed$cases, df_date_death$cases)
### Q0. Change the values of the location and the period and see the outcomes.
### Q1. What is the correlation between df_confirmed$cases and df_death$cases?
### Q2. Do we have a larger correlation value if we shift the dates to implement the time-lag?
### Q3. Do you have any other questions to explore?
#### Extra
plot(df_confirmed$date, df_confirmed$cases, type = "h",
main = paste("Comfirmed Cases in",COUNTRY),
xlab = "Date", ylab = "Number of Cases")
:::
gapminder PackageHans Rosling was a Swedish physician, academic, and public speaker. He was a professor of international health at Karolinska Institute[4] and was the co-founder and chairman of the Gapminder Foundation, which developed the Trendalyzer software system. (wikipedia)
recognizing when a decision feels urgent and remembering that it rarely is.
To control the urgency instinct, take small steps.
# install.packages("gapminder")
library(gapminder)
df <- gapminder
df
glimpse(df)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 41…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12881816…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.01…
summary(df)
country continent year lifeExp pop
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09
(Other) :1632
gdpPercap
Min. : 241.2
1st Qu.: 1202.1
Median : 3531.8
Mean : 7215.3
3rd Qu.: 9325.5
Max. :113523.1
tidyversermarkdowngapminderEDA from r4ds
Today: R Markdown and dplyr
What is R Markdown: https://vimeo.com/178485416
R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both
R Notebooks are an implementation of Literate Programming that allows for direct interaction with R while producing a reproducible document with publication-quality output.
An R Notebook is an R Markdown document with chunks that can be executed independently and interactively, with output visible immediately beneath the input.
(Reference: R Markdown: The Definitive Guide, 3.2 Notebook)
Important: Implementation of Reproducible Research and Literate Programming
Useful to Render into Various Formats: R Notebook (HTML), R Markdown (HTML), PDF, MS Word, MS Powerpoint, Ioslides Presentation (HTML), Slidy Presentation (HTML), Beamer Presentation (PDF), etc.
Literate programming is an approach to programming introduced by Donald Knuth in which a program is given as an explanation of the program logic in a natural language, such as English, interspersed with snippets of macros and traditional source code, from which a compilable source code can be generated
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.
Reproducible research is the idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them. The need for reproducibility is increasing dramatically as data analyses become more complex, involving larger datasets and more sophisticated computations. Reproducibility allows for people to focus on the actual content of a data analysis, rather than on superficial details reported in a written summary. In addition, reproducibility makes an analysis more useful to others because the data and code that actually conducted the analysis are available.
R Markdown is also important because it so tightly integrates prose and code. This makes it a great analysis notebook because it lets you develop code and record your thoughts. It:
Records what you did and why you did it. Regardless of how great your memory is, if you don’t record what you do, there will come a time when you have forgotten important details. Write them down so you don’t forget!
Supports rigorous thinking. You are more likely to come up with a strong analysis if you record your thoughts as you go, and continue to reflect on them. This also saves you time when you eventually write up your analysis to share with others.
Helps others understand your work. It is rare to do data analysis by yourself, and you’ll often be working as part of a team. A lab notebook helps you share why you did it with your colleagues or lab mates.
rmarkdown
install.packages("rmarkdown")tinytex (for pdf generation)
install.packages('tinytex')tinytex::install_tinytex() #
install TinyTeX
Terminal in the
left below pane:
plot(cars) and then Preview again.Template to submit your assignment of this course: RNotebook_Template.nb.html
title: "Title of R Notebook"
author: "ID and Your Name"
date: "2023-01-08"
output:
html_notebook: null
Various Output Formats: test-rmarkdown.nb.html
title: "Testing R Markdown Formats"
author: "DS-SL"
date: "2023-01-08"
output:
html_notebook:
number_sections: yes
pdf_document:
number_sections: yes
html_document:
df_print: paged
number_sections: yes
word_document:
number_sections: yes
powerpoint_presentation: default
ioslides_presentation:
widescreen: yes
smaller: yes
slidy_presentation: default
beamer_presentation: default
--- is page break for presentation formats.ref-doc-style.docxref-doc-style.docxref-doc-style.docx as reference_doc in YAML with
indention as below word_document:
number_sections: yes
reference_doc: ref-doc-style.docx
powerpoint_presentation:
reference_doc: ref-ppt-style.pptx
Output Options at the bottom of the gear
icon next to Preview/knit button.$\frac{a}{b}$ for \(\frac{a}{b}\)_italic_, Bold
text by **bold**R Studio introduced Visual Editor towards the end of 2021. It seems to be stable but it is not perfect to go back and forth from the original editor using tags. I always use the original editor and I am confident on all the functions of it but I do not have much experience on Visual Editor. [My Note in QALL401 2021]
dplyrdplyr
Overviewdplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
select() picks variables based on their names.filter() picks cases based on their values.mutate() adds new variables that are functions of
existing variablessummarise() reduces multiple values down to a single
summary.arrange() changes the ordering of the rows.group_by() takes an existing tbl and converts it into a
grouped tbl.You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”).
If you are new to dplyr, the best place to start is the data transformation chapter in R for data science.
select:
Subset columns using their names and types| Helper Function | Use | Example |
|---|---|---|
| - | Columns except | select(babynames, -prop) |
| : | Columns between (inclusive) | select(babynames, year:n) |
| contains() | Columns that contains a string | select(babynames, contains(“n”)) |
| ends_with() | Columns that ends with a string | select(babynames, ends_with(“n”)) |
| matches() | Columns that matches a regex | select(babynames, matches(“n”)) |
| num_range() | Columns with a numerical suffix in the range | Not applicable with babynames |
| one_of() | Columns whose name appear in the given set | select(babynames, one_of(c(“sex”, “gender”))) |
| starts_with() | Columns that starts with a string | select(babynames, starts_with(“n”)) |
filter:
Subset rows using column values| Logical operator | tests | Example |
|---|---|---|
| > | Is x greater than y? | x > y |
| >= | Is x greater than or equal to y? | x >= y |
| < | Is x less than y? | x < y |
| <= | Is x less than or equal to y? | x <= y |
| == | Is x equal to y? | x == y |
| != | Is x not equal to y? | x != y |
| is.na() | Is x an NA? | is.na(x) |
| !is.na() | Is x not an NA? | !is.na(x) |
arrange
and Pipe %>%arrange() orders the rows of a data frame by the values
of selected columns.Unlike other dplyr verbs, arrange() largely
ignores grouping; you need to explicitly mention grouping variables (`or
use .by_group = TRUE) in order to group by them, and functions of
variables are evaluated once per data frame, not once per group.
pipes
in R for Data Science.mutateCreate, modify, and delete columns
Useful mutate functions
+, -, log(), etc., for their usual mathematical meanings
lead(), lag()
dense_rank(), min_rank(), percent_rank(), row_number(), cume_dist(), ntile()
cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()
na_if(), coalesce()### group_by() and
summarise()
group_bysummarise
or summarizeSo far our summarise() examples have relied on sum(), max(), and mean(). But you can use any function in summarise() so long as it meets one criteria: the function must take a vector of values as input and return a single value as output. Functions that do this are known as summary functions and they are common in the field of descriptive statistics. Some of the most useful summary functions include:
dplyr by Examplesirisiris
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
select 1 - columns 1, 2, 5select(iris, c(1,2,5))
select 2 - except Speciesselect(iris, -Species)
select 3 - change column namesselect(iris, sl = Sepal.Length, sw = Sepal.Width, sp = Species)
filter - by namesfilter(iris, Species == "virginica")
arrange - ascending and descending orderarrange(iris, Sepal.Length, desc(Sepal.Width))
mutate - rankiris %>% mutate(sl_rank = min_rank(Sepal.Length)) %>% arrange(sl_rank)
group_by and summarizeiris %>%
group_by(Species) %>%
summarize(sl = mean(Sepal.Length), sw = mean(Sepal.Width),
pl = mean(Petal.Length), pw = mean(Petal.Width))
mean() or mean(x, na.rm = TRUE) -
arithmetic mean (average)median() or
median(x, na.rm = TRUE) - mid valueFor more examples see
dplyrdplyr by Examples II - gapminderggplot2 Overviewggplot2 is a system for declaratively creating graphics,
based on The Grammar of Graphics.
You provide the data, tell ggplot2 how to map variables to aesthetics,
what graphical primitives to use, and it takes care of the details.
Examples
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = class, y = hwy))
Template
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
gapminderGapminder was founded by Ola Rosling, Anna Rosling Rönnlund, and Hans Rosling
Gapminder: https://www.gapminder.org
R Package gapminder by Jennifer Bryan
Package Help ?gapminder or gapminder in
the search window of Help
library(tidyverse)
library(gapminder)
library(WDI)
gapminder datadf <- gapminder
df
glimpse(df)
Rows: 1,704
Columns: 6
$ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afgha…
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
$ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, …
$ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 41…
$ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12881816…
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.01…
summary(df)
country continent year lifeExp pop
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
Australia : 12 Max. :2007 Max. :82.60 Max. :1.319e+09
(Other) :1632
gdpPercap
Min. : 241.2
1st Qu.: 1202.1
Median : 3531.8
Mean : 7215.3
3rd Qu.: 9325.5
Max. :113523.1
ggplot(df, aes(x = year, y = lifeExp)) + geom_point()
ggplot(df, aes(x = year, y = lifeExp)) + geom_line()
ggplot(df, aes(x = year, y = lifeExp)) + geom_boxplot()
typeof(pull(df, year)) # same as typeof(df$year)
[1] "integer"
ggplot(df, aes(y = lifeExp, group = year)) + geom_boxplot()
ggplot(df, aes(x = as_factor(year), y = lifeExp)) + geom_boxplot()
dplyrfilterdf %>% filter(country == "Afghanistan") %>%
ggplot(aes(x = year, y = lifeExp)) + geom_line()
df %>% filter(country %in% c("Afghanistan", "Japan")) %>%
ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()
df %>% distinct(country) %>% pull()
[1] Afghanistan Albania Algeria
[4] Angola Argentina Australia
[7] Austria Bahrain Bangladesh
[10] Belgium Benin Bolivia
[13] Bosnia and Herzegovina Botswana Brazil
[16] Bulgaria Burkina Faso Burundi
[19] Cambodia Cameroon Canada
[22] Central African Republic Chad Chile
[25] China Colombia Comoros
[28] Congo, Dem. Rep. Congo, Rep. Costa Rica
[31] Cote d'Ivoire Croatia Cuba
[34] Czech Republic Denmark Djibouti
[37] Dominican Republic Ecuador Egypt
[40] El Salvador Equatorial Guinea Eritrea
[43] Ethiopia Finland France
[46] Gabon Gambia Germany
[49] Ghana Greece Guatemala
[52] Guinea Guinea-Bissau Haiti
[55] Honduras Hong Kong, China Hungary
[58] Iceland India Indonesia
[61] Iran Iraq Ireland
[64] Israel Italy Jamaica
[67] Japan Jordan Kenya
[70] Korea, Dem. Rep. Korea, Rep. Kuwait
[73] Lebanon Lesotho Liberia
[76] Libya Madagascar Malawi
[79] Malaysia Mali Mauritania
[82] Mauritius Mexico Mongolia
[85] Montenegro Morocco Mozambique
[88] Myanmar Namibia Nepal
[91] Netherlands New Zealand Nicaragua
[94] Niger Nigeria Norway
[97] Oman Pakistan Panama
[100] Paraguay Peru Philippines
[103] Poland Portugal Puerto Rico
[106] Reunion Romania Rwanda
[109] Sao Tome and Principe Saudi Arabia Senegal
[112] Serbia Sierra Leone Singapore
[115] Slovak Republic Slovenia Somalia
[118] South Africa Spain Sri Lanka
[121] Sudan Swaziland Sweden
[124] Switzerland Syria Taiwan
[127] Tanzania Thailand Togo
[130] Trinidad and Tobago Tunisia Turkey
[133] Uganda United Kingdom United States
[136] Uruguay Venezuela Vietnam
[139] West Bank and Gaza Yemen, Rep. Zambia
[142] Zimbabwe
142 Levels: Afghanistan Albania Algeria Angola Argentina Australia Austria ... Zimbabwe
df %>% filter(country %in% c("Brazil", "Russia", "India", "China")) %>%
ggplot(aes(x = year, y = lifeExp, color = country)) + geom_line()
Russian data is missing.
lifeExp to pop and
gdpPercap and do the same.group_by and summarizeLet us use the variable continent and summarize the
data.
df_lifeExp <- df %>% group_by(continent, year) %>%
summarize(mean_lifeExp = mean(lifeExp), median_lifeExp = median(lifeExp), max_lifeExp = max(lifeExp), min_lifeExp = min(lifeExp), .groups = "keep")
df_lifeExp
df %>% filter(year %in% c(1952, 1987, 2007)) %>%
ggplot(aes(x=as_factor(year), y = lifeExp, fill = continent)) +
geom_boxplot()
df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent)) +
geom_line()
df_lifeExp %>% ggplot(aes(x = year, y = mean_lifeExp, color = continent, linetype = continent)) +
geom_line()
df_lifeExp %>% ggplot() +
geom_line(aes(x = year, y = mean_lifeExp, color = continent)) +
geom_line(aes(x = year, y = median_lifeExp, linetype = continent))
R Markdown and dplyr
a2_123456.nb.html)
a2_123456.Rmd,a2_123456.nb.html,a2_123456.nb.html to Moodle.Pick data from the built-in datasets besides cars.
(library(help = "datasets") or go to the site The
R Datasets Package)
head(), str(), …, and create at least
one chart using ggplot2 - Code Chunk.
library(tidyverse) in the first
code chunk.Load gapminder by
library(gapminder).
pop or gdpPercap, or both, one
country in the data, a group of countries in the data.lifeExp.)Due: 2023-01-09 23:59:00. Submit your R Notebook file in Moodle (The Second Assignment). Due on Monday!
gapminder
df_wdi <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD")
)
df_wdi
df_wdi_extra <- WDI(
country = "all",
indicator = c(lifeExp = "SP.DYN.LE00.IN", pop = "SP.POP.TOTL", gdpPercap = "NY.GDP.PCAP.KD"),
extra = TRUE
)
df_wdi_extra
library(tidyverse)
library(gapminder)
(df <- gapminder)
asean <- c("Brunei", "Cambodia", "Laos", "Myanmar", "Philippines", "Indonesia", "Malaysia", "Singapore")
df %>% filter(country %in% asean) %>%
ggplot(aes(x = year, y = gdpPercap, col = country)) + geom_line()
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) + geom_point()
library(ggrepel)
df2007 <- df %>% filter(country %in% asean, year == 2007)
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country))+
geom_line() + geom_label_repel(data = df2007, aes(label = country)) + geom_point() +
coord_trans(x = "log10", y = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust=1), legend.position = "none") +
labs(title = "Life Expectancy vs GDP Per Capita of ASEAN Countries",
subtitle = "Data: gapminder package", x = "GDP per Capita", y = "Life Expectancy")
df %>% filter(country %in% asean) %>%
ggplot(aes(x = gdpPercap, y = lifeExp, col = country)) +
geom_point() + coord_trans(x = "log10", y = "identity")
library(tidyverse)
library(maps)
world_map <- map_data("world")
df %>%
ggplot(aes(map_id = country)) +
geom_map(aes(fill = log10(gdpPercap)), map = world_map) + expand_limits(x = world_map$long, y = world_map$lat)
\(\log_{10}{100}\) = 2, \(\log_{10}{1000}\) = 3, \(\log_{10}{10000}\) = 4
\(10^{2.5}\) = 316.227766, \(10^{3}\) = 1000, \(10^{3.5}\) = 3162.2776602, \(10^{4}\) = 10^{4}, \(10^{4.5}\) = 3.1622777^{4}.
(x4 <-round(10^4,1))
(x45 <- round(10^(4.5),1))
3.16228^{4}, 10^{4}
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
WDIThe term ``Open Data’’ has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.
WDI(country = "all",
indicator = "NY.GDP.PCAP.KD",
start = 1960,
end = 2020,
extra = FALSE,
cache = NULL)
c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS')library(WDI)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", cache = NULL)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", short = FALSE, cache = NULL)
WDIsearch(string = "gdp",
field = "name", short = TRUE, cache = NULL)
readr, readxlreadr, ggplot2; Public Data, WDI, WIR,
etc
EDA is an iterative cycle that helps you understand what your data says. When you do EDA, you:
Generate questions about your data
Search for answers by visualising, transforming, and/or modeling your data
Use what you learn to refine your questions and/or generate new questions
EDA is an important part of any data analysis. You can use EDA to make discoveries about the world; or you can use EDA to ensure the quality of your data, asking questions about whether the data meets your standards or not.
There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:
The rest of this tutorial will look at these two questions. To make the discussion easier, let’s define some terms…
ggplot2 Basicsvisualization
ggplot2 Extratidyr
Basicstidyrtidyr, etc.; WDI, WIR, etc
Data Source
Variables
Problems
Visualization
Model
Conclusions and Further Research
WDI, WIR, etc
1.6 Comments on Week 2
1.6.0.1 Helpful Resources
Cheat Sheet in RStudio: https://www.rstudio.com/resources/cheatsheets/
‘Quick R’ by DataCamp: https://www.statmethods.net/management
An Introduction to R
1.6.0.2 Practicum
1.6.0.3 Assignments - See Moodle